Learning to use code to manipulate large datasets and conduct data analysis is becoming a critical skill for many professionals in the field of environmental sciences. This practical and tutorial will teach you the basics of using R to conduct data analysis using a long-term dataset of surface water quality and to interpret the results, linking real data to course theory.
You are expected to use your own computer for all exercises. By the end of this practical, you will be able to:
Here I present a tutorial that will guide you on the use of different R functions. I will walk you though the tutorial during the practical. You can use this tutorial as a starting point to use R for data analysis in the future.
The tutorial is organized in 5 main sections.
R is a very powerful statistical programming language that is used broadly by researchers around the world. R is an attractive programming language because it is free, open source, and platform independent. With all the libraries that are available (and those that are in rapid development), it is quickly becoming a one-stop shop for all your analysis needs. Most academic statisticians now use R, which has allowed for greater sharing of R code or packages to implement their recommended methods. One of the very first things academics often ask when hiring someone is simply, “Can you describe your R or statistical programming experience?” It is now critical to have this background to be competitive for scientific (and many other) positions.
Among the reasons to use R you have:
R does have a steep learning curve that can often be intimidating to new users, particularly those without prior coding experience. While this can be very frustrating in the initial stages, learning R is like learning a language where proficiency requires practice and continual use of the program.
Our advice is to push yourself to use this tool in everything you do. At first, R will not be the easiest or quickest option. With persistence, however, you will see the benefit of R and continue to find new ways to use it in your work.
R is available for Linux, MacOS X, and Windows (95 or later) platforms. Software can be downloaded from one of the Comprehensive R Archive Network (CRAN) mirror sites. Once installed, R will open a console where you run code. You can also work on a script file, where you can write and save your work, and other windows that will show up on demand such as the plot tab (Fig. 1).
RStudio is an enterprise-ready professional software tool that integrates with R. It has some nice features beyond the normal R interface, which many users feel it is easier to use than R (Fig. 2). Once you have installed R, you should also download and install RStudio. For this course, we will work exclusively in RStudio.
Make sure you have installed R and R Studio.
The last think you need is the data we will use. To make sure we are all organised, first create a Folder in your computer for this project. Name it BL3003.
Now, inside this folder, create a new folder called Data.
Download the from this drive and paste it inside the Data folder. There are two files, China_dataset.csv and Ireland_dataset.csv.
You are now ready for class :)
There are a few concepts that are important to keep in mind before
you start coding. The fact that R is a programming
language may deter some users who think “I can’t program”. This should
not be the case for two reasons. First, R is an
interpreted language, not a compiled one, meaning that all commands
typed on the keyboard are directly executed without requiring you to
build a complete program like in most computer languages (C, Pascal, . .
. ). Second, R’s syntax is very simple and intuitive.
For instance, a linear regression can be done with the command
lm(y ~ x) which means fitting a linear model with y as the
response and x as a predictor.
In R, in order to be executed, a function always
needs to be written with parentheses, even if there is nothing within
them (e.g., ls()). If you type the name of a function
without parentheses, R will display the content of the
function.
When R is running, variables, data, functions, results, etc…, are stored in the active memory of the computer in the form of objects that you assign a name.The user can do actions on these objects with operators (arithmetic, logical, comparison, . . . ) and functions (which are themselves objects).
The name of an object must start with a letter (A-Z or a-z) and can be followed by letters, digits (0-9), dots (.), and underscores (_).
When referring to the directory of a folder or a data file, R uses forward slash “/”. You need to pay close attention to the direction of the slash if you copy a file path or directory from a Windows machine.
It is also important to know that R discriminates between uppercase and lowercase letters in the names of objects, so that x and X can name two distinct objects (even under Windows).
Like in many other programs, you should start your session by defining your working directory - the folder where you will work. This will be the location on your computer where any files you save will be located. To determine your current working directory, type:
getwd()
Use setwd() to change or set a new working directory.
For instance, you can set your working directory to be in your Documents
folder on the C: drive, or in any folder you prefer.
setwd("C:/Documents/R_Practice")
There are three fundamental data types in R that you will work with in this practical:
You can check the data type of an object using the function
class(). To convert between data types you can use:
as.integer(), as.numeric(),
as.logical(), as. character().
For instance:
city <- 'Beijing'
class(city)
## [1] "character"
number <- 3.4
class(number)
## [1] "numeric"
Integer <- as.integer(number)
Integer
## [1] 3
class(Integer)
## [1] "integer"
Since R is a programming language, we can store information as objects to avoid unnecessary repetition. Note again that values are case sensitive; ‘x’ is not the same as ‘X’!
city <- "Cork"
summary(city)
number <- 2
summary(number)
character <- as.character(2)
character
Data are very often stored in different folders to maintain an organizational pattern in your projects. In those cases, it is not necessary to re-set the working directory every time we want to import files to R that are stored in different folders, as long as these folders are within the root directory you have previously set. For instance, let’s say you have a table stored in a folder called data, which is a subfolder within your root working directory (C:/Documents/R_Practice). You can point to the data folder when reading the table as in the example below:
table <- read.csv(file="./data/TheDataIWantToReadIn.csv", header=TRUE) # read a csv table stored in the data folder
Note that because data is a subfolder in your root directory, you do not need to provide the complete directory information when reading the table “./data/TheDataIWantToReadIn.csv”. You can always provide the full directory of a data file stored on your local drive to avoid confusion.
The # character is used to add comments to your code. # indicates the beginning of a comment and everything after # on a line will be ignored and not run as code. Adding comments to your code is considered good practice because it allows you to describe in plain language (for yourself or others) what your code is doing.
#This is a comment
Vectors are a basic data structure in R. They
contain a sequence of data and can contain characters, numbers, or be
TRUE/FALSE values. Remember: If you are unsure or need help, use the
help function (e.g., help(seq) or ?seq).
Below are several ways to create vectors in R.
1:20
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
c(1,2,3,4,5)
## [1] 1 2 3 4 5
seq(0,100,by=10)
## [1] 0 10 20 30 40 50 60 70 80 90 100
Matrices and dataframes are common ways to store tabular data. Understanding how to manipulate them is important to be able to conduct more complex analyses. Both matrices and dataframes are composed of rows and columns. The main difference between matrices and dataframes is that dataframes can contain many different classes of data (numeric, character, etc.), while matrices can only contain a single class.
Create a matrix with 4 rows and 5 columns using the data from
x above. Consult the help (e.g., help(matrix)
or ?matrix) to determine the syntax required.
x <- seq(1:20)
test_matrix <- matrix(data = x, nrow = 4, ncol = 5)
test_matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
# Note, I can assign any name to an object that I create. Generally it is best to name things in a way that is meaningful.
Now, if we wanted to reference any value in the matrix, we could do so with matrix notation. The first value in matrix notation references the row and the second value references the column. COMMIT THIS TO MEMORY! I remember this by thinking Roman Catholic. So, if you wanted to view only the value in the 1st row, 5th column, you’d type:
#test_matrix(row,column)
test_matrix[1,5]
## [1] 17
In addition to using positive integers to indicate the exact location of the subset of data we want to extract, you can also use other notation to indicate subsets of data that you want to include or exclude. You can use: negative integers (to exclude data at a specific location), zero (to create empty objects with consistent format), blank spaces (to select the entire row/column), logical values (to select the data associated with TRUE values), or names (to select specific columns or rows by their names). Try to understand how each type of notation works!
For example, what if you wanted to view all the values in the 5th column? This literally says, extract all rows but only the 5th column from the object called test_matrix.
test_matrix[,5]
## [1] 17 18 19 20
What about the 4th row?
test_matrix[4,]
## [1] 4 8 12 16 20
What happens to the matrix if we append a character field? Use the
cbind() (column bind) command to bind a new column, called
‘countries’. Note that I am not changing the contents of test_matrix.
Can you figure out how to do a row bind (hint: use
rbind())
countries <- c("United States", "Pakistan", "Ireland", "China")
cbind(test_matrix,countries)
## countries
## [1,] "1" "5" "9" "13" "17" "United States"
## [2,] "2" "6" "10" "14" "18" "Pakistan"
## [3,] "3" "7" "11" "15" "19" "Ireland"
## [4,] "4" "8" "12" "16" "20" "China"
#Note that I am not changing/overwriting the contents of test_matrix. I could, but I'd have to change my code to
#test_matrix <- cbind(test_matrix,countries)
Why is everything inside the table now enclosed in quotes? Recall what we said about matrices only containing one data type. What happens if I coerce this to a dataframe?
test_dataframe <- data.frame(test_matrix,countries)
test_dataframe
## X1 X2 X3 X4 X5 countries
## 1 1 5 9 13 17 United States
## 2 2 6 10 14 18 Pakistan
## 3 3 7 11 15 19 Ireland
## 4 4 8 12 16 20 China
# Have I changed the file type?
class(test_dataframe)
## [1] "data.frame"
Can I rename the column headings?
names(test_dataframe) <- c("Val1", "Val2", "Val3", "Val4", "Val5", "Countries")
test_dataframe
## Val1 Val2 Val3 Val4 Val5 Countries
## 1 1 5 9 13 17 United States
## 2 2 6 10 14 18 Pakistan
## 3 3 7 11 15 19 Ireland
## 4 4 8 12 16 20 China
Can I use the same matrix notation to reference a particular row and column? Are there other ways to reference a value?
test_dataframe[3,5]
## [1] 19
test_dataframe[,5]
## [1] 17 18 19 20
test_dataframe$Val5[3]
## [1] 19
test_dataframe$Val5
## [1] 17 18 19 20
test_dataframe[,"Val5"]
## [1] 17 18 19 20
You can also use some very simple commands to determine the size of dataframes or matrices.
nrow(test_dataframe)
## [1] 4
ncol(test_dataframe)
## [1] 6
dim(test_dataframe)
## [1] 4 6
R functions can be defined as a collection of arguments structured together for carrying out a definite task. Functions have optional input and output arguments that return a value. Custom functions can be easily constructed in R. Most often, however, we will use built-in functions within base packages or other downloadable packages.
Most functions have optional arguments or are given default values (in the function’s help document, under the ‘Usage’ section, the optional arguments are given a default value following the “=” symbol). When you don’t specify the optional arguments, they will take the default values. Functions normally can be called using the following format: function_name(input_data, argument1, argument2.)
print(2+2)
## [1] 4
x <- matrix(1:10, 5, 2)
x
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
y <- matrix(1:5)
y
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
df.example <- cbind(x, y)
df.example
## [,1] [,2] [,3]
## [1,] 1 6 1
## [2,] 2 7 2
## [3,] 3 8 3
## [4,] 4 9 4
## [5,] 5 10 5
?function_name can load the function help file. Also note that any functions in non-base packages will require installing and loading that package.
Here, for example, we install and load package named “ggplot2” that we will use for data visualization.
install.packages("ggplot2")
library(ggplot2)
R also contains many pre-existing functions in the
base software. Numeric functions include sum(),
mean(), sd(), min(),
max(), median(), range(),
quantile(), or summary(). Try a few of these
on the numeric vectors you have created.
sum(x)
## [1] 55
summary(x)
## V1 V2
## Min. :1 Min. : 6
## 1st Qu.:2 1st Qu.: 7
## Median :3 Median : 8
## Mean :3 Mean : 8
## 3rd Qu.:4 3rd Qu.: 9
## Max. :5 Max. :10
range(y)
## [1] 1 5
R can be used to perform basic calculations and report the results back to the user.
4+2
## [1] 6
6*8
## [1] 48
(842-62)/3
## [1] 260
Exponentiation: ^
2^3
## [1] 8
Max and Min: max(), min()
vector_numbers <- c(2, 3, 4, 10)
max(vector_numbers)
## [1] 10
min(vector_numbers)
## [1] 2
Can you calculate the square root and then subtract 5 for each element in vector_number?
One of the most useful commands in R is
?. At the command prompt (signified by > in
your Console window), type ? followed by any command and
you will be prompted with a help tab for that command (e.g.,
?mean Fig. 3). You can also search through the help tab
directly by searching functions on the search bar.
The internet also contains a vast quantity of useful information. There are blogs, mailing lists, and various websites (e.g., https://stackoverflow.com/) dedicated to providing information about R, its packages, and potential error messages that you may encounter (among other things). The trick is usually determining the key terms to limit your search. I generally start any web-based search with “R-Cran”, which limits and focuses the search. Using “R” as part of your key terms does not, by itself, limit the search.
Now that you’ve learned the basics of R programming, we’ll take things a step further.
We’ll be working with a dataset published in the paper by Karim et al. 2025.
This is a a comprehensive surface water quality dataset assembled from a range of regional and global water quality databases, water management organizations, and individual research projects from five countries: USA, Canada, Ireland, England, and China. We will practice now with the Chinese dataset. The goal of this exercise is to test your basic skills in R programming, specifically in manipulating data.
You may not be familiar with all the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available to you so that you can become an effective problem solver, working independently on data analyses.
Whenever you start working with a new script, you should first set a working directory. This directory will contain all the data for your analysis and will be where you will save all the data outputs.
Remember that you can check the current working directory by typing:
getwd()
## [1] "/Users/ramirocrego/Documents/GitHub/UCC_BL3009_Practical"
Now, let’s change the working directory to the BL3009 folder you created before class.
setwd("C:/..../BL3009")
Load the China_dataset csv file I provided you. For this we
will use the read.csv() funtion. Make sure the data is in
your working directory. Note that I have created a folder Data that
contains the csv file.
data <- read.csv("./Data/China_dataset.csv")
View the first 10 lines of the data set.
head(data, 10)
## Country Area Waterbody.Type Date Ammonia..mg.l.
## 1 China Hou Bay Bay 11-01-2001 3.5
## 2 China Hou Bay Bay 12-02-2001 6.7
## 3 China Hou Bay Bay 14-03-2001 4.5
## 4 China Hou Bay Bay 17-04-2001 5.4
## 5 China Hou Bay Bay 11-05-2001 3.3
## 6 China Hou Bay Bay 14-06-2001 4.5
## 7 China Hou Bay Bay 09-07-2001 1.6
## 8 China Hou Bay Bay 22-08-2001 2.6
## 9 China Hou Bay Bay 20-09-2001 2.8
## 10 China Hou Bay Bay 19-10-2001 4.2
## Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1 0.7 8.2000
## 2 2.1 11.6841
## 3 0.3 11.6841
## 4 5.2 11.6841
## 5 1.7 11.6841
## 6 7.7 11.6841
## 7 1.4 4.5000
## 8 3.2 11.6841
## 9 4.2 1.8000
## 10 2.8 3.5000
## Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1 0.40 8.6 17.0 0.21
## 2 0.72 7.3 18.6 0.13
## 3 0.51 6.8 19.5 0.32
## 4 0.58 7.3 23.6 0.11
## 5 0.49 7.4 27.2 0.28
## 6 0.32 7.3 27.2 0.19
## 7 0.19 6.2 29.4 0.36
## 8 0.27 7.9 30.8 0.26
## 9 0.30 7.3 29.0 0.33
## 10 0.46 7.3 25.3 0.35
## Nitrate..mg.l. CCME_Values CCME_WQI
## 1 0.48 63.68388 Marginal
## 2 0.16 58.30881 Marginal
## 3 0.65 63.36902 Marginal
## 4 0.32 61.18353 Marginal
## 5 0.68 62.39702 Marginal
## 6 0.50 58.85799 Marginal
## 7 0.70 73.17806 Fair
## 8 0.33 68.12720 Fair
## 9 0.38 67.02537 Fair
## 10 0.25 61.49847 Marginal
Assess the overall structure of the data set to get a sense of the number and type of variables included. Assure that the data structure of each column of the data frame is correct and/or what you expect it to be.
str(data)
## 'data.frame': 45997 obs. of 14 variables:
## $ Country : chr "China" "China" "China" "China" ...
## $ Area : chr "Hou Bay" "Hou Bay" "Hou Bay" "Hou Bay" ...
## $ Waterbody.Type : chr "Bay" "Bay" "Bay" "Bay" ...
## $ Date : chr "11-01-2001" "12-02-2001" "14-03-2001" "17-04-2001" ...
## $ Ammonia..mg.l. : num 3.5 6.7 4.5 5.4 3.3 4.5 1.6 2.6 2.8 4.2 ...
## $ Biochemical.Oxygen.Demand..mg.l.: num 0.7 2.1 0.3 5.2 1.7 7.7 1.4 3.2 4.2 2.8 ...
## $ Dissolved.Oxygen..mg.l. : num 8.2 11.7 11.7 11.7 11.7 ...
## $ Orthophosphate..mg.l. : num 0.4 0.72 0.51 0.58 0.49 0.32 0.19 0.27 0.3 0.46 ...
## $ pH..ph.units. : num 8.6 7.3 6.8 7.3 7.4 7.3 6.2 7.9 7.3 7.3 ...
## $ Temperature..cel. : num 17 18.6 19.5 23.6 27.2 27.2 29.4 30.8 29 25.3 ...
## $ Nitrogen..mg.l. : num 0.21 0.13 0.32 0.11 0.28 0.19 0.36 0.26 0.33 0.35 ...
## $ Nitrate..mg.l. : num 0.48 0.16 0.65 0.32 0.68 0.5 0.7 0.33 0.38 0.25 ...
## $ CCME_Values : num 63.7 58.3 63.4 61.2 62.4 ...
## $ CCME_WQI : chr "Marginal" "Marginal" "Marginal" "Marginal" ...
Some variables appear as character, but we want them to be factors,
that is, chategorical variables with levels. We can tell R to change the
data type from character to factor usign the function
as.factor():
data$Country <- as.factor(data$Country)
data$Area <- as.factor(data$Area)
data$Waterbody.Type <- as.factor(data$Waterbody.Type)
Check again the data structure
str(data)
## 'data.frame': 45997 obs. of 14 variables:
## $ Country : Factor w/ 1 level "China": 1 1 1 1 1 1 1 1 1 1 ...
## $ Area : Factor w/ 1 level "Hou Bay": 1 1 1 1 1 1 1 1 1 1 ...
## $ Waterbody.Type : Factor w/ 1 level "Bay": 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "11-01-2001" "12-02-2001" "14-03-2001" "17-04-2001" ...
## $ Ammonia..mg.l. : num 3.5 6.7 4.5 5.4 3.3 4.5 1.6 2.6 2.8 4.2 ...
## $ Biochemical.Oxygen.Demand..mg.l.: num 0.7 2.1 0.3 5.2 1.7 7.7 1.4 3.2 4.2 2.8 ...
## $ Dissolved.Oxygen..mg.l. : num 8.2 11.7 11.7 11.7 11.7 ...
## $ Orthophosphate..mg.l. : num 0.4 0.72 0.51 0.58 0.49 0.32 0.19 0.27 0.3 0.46 ...
## $ pH..ph.units. : num 8.6 7.3 6.8 7.3 7.4 7.3 6.2 7.9 7.3 7.3 ...
## $ Temperature..cel. : num 17 18.6 19.5 23.6 27.2 27.2 29.4 30.8 29 25.3 ...
## $ Nitrogen..mg.l. : num 0.21 0.13 0.32 0.11 0.28 0.19 0.36 0.26 0.33 0.35 ...
## $ Nitrate..mg.l. : num 0.48 0.16 0.65 0.32 0.68 0.5 0.7 0.33 0.38 0.25 ...
## $ CCME_Values : num 63.7 58.3 63.4 61.2 62.4 ...
## $ CCME_WQI : chr "Marginal" "Marginal" "Marginal" "Marginal" ...
Do you note the difference?
When we have dates in our data, we need to tell R that the data
should be read as dates and not characters. To do that, we will use the
function as.Date(). Note that you need to specify how the
date is formated. There are multiple conventions, like day-month-year,
or month-day-year, etc. In our case, the dates are writen as day, month,
and year (e.g., “11-01-2001”)
data$Date <- as.Date(data$Date, format = "%d-%M-%Y")
str(data)
## 'data.frame': 45997 obs. of 14 variables:
## $ Country : Factor w/ 1 level "China": 1 1 1 1 1 1 1 1 1 1 ...
## $ Area : Factor w/ 1 level "Hou Bay": 1 1 1 1 1 1 1 1 1 1 ...
## $ Waterbody.Type : Factor w/ 1 level "Bay": 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2001-02-11" "2001-02-12" ...
## $ Ammonia..mg.l. : num 3.5 6.7 4.5 5.4 3.3 4.5 1.6 2.6 2.8 4.2 ...
## $ Biochemical.Oxygen.Demand..mg.l.: num 0.7 2.1 0.3 5.2 1.7 7.7 1.4 3.2 4.2 2.8 ...
## $ Dissolved.Oxygen..mg.l. : num 8.2 11.7 11.7 11.7 11.7 ...
## $ Orthophosphate..mg.l. : num 0.4 0.72 0.51 0.58 0.49 0.32 0.19 0.27 0.3 0.46 ...
## $ pH..ph.units. : num 8.6 7.3 6.8 7.3 7.4 7.3 6.2 7.9 7.3 7.3 ...
## $ Temperature..cel. : num 17 18.6 19.5 23.6 27.2 27.2 29.4 30.8 29 25.3 ...
## $ Nitrogen..mg.l. : num 0.21 0.13 0.32 0.11 0.28 0.19 0.36 0.26 0.33 0.35 ...
## $ Nitrate..mg.l. : num 0.48 0.16 0.65 0.32 0.68 0.5 0.7 0.33 0.38 0.25 ...
## $ CCME_Values : num 63.7 58.3 63.4 61.2 62.4 ...
## $ CCME_WQI : chr "Marginal" "Marginal" "Marginal" "Marginal" ...
Note the new Date format.
Now, summarize the data to have a look at all variables.
summary(data)
## Country Area Waterbody.Type Date
## China:45997 Hou Bay:45997 Bay:45997 Min. :2001-02-01
## 1st Qu.:2005-02-11
## Median :2009-02-21
## Mean :2009-04-29
## 3rd Qu.:2014-02-01
## Max. :2017-02-28
## NA's :500
## Ammonia..mg.l. Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## Min. : 0.0050 Min. : 0.1000 Min. : 0.000
## 1st Qu.: 0.0240 1st Qu.: 0.5000 1st Qu.: 6.000
## Median : 0.0460 Median : 0.7000 Median : 7.600
## Mean : 0.1007 Mean : 0.9106 Mean : 8.314
## 3rd Qu.: 0.1000 3rd Qu.: 1.1000 3rd Qu.:11.684
## Max. :10.0000 Max. :21.0000 Max. :16.100
##
## Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## Min. :0.00200 Min. :2.100 Min. :13.0 Min. :0.00200
## 1st Qu.:0.00600 1st Qu.:7.900 1st Qu.:19.8 1st Qu.:0.01100
## Median :0.01100 Median :8.000 Median :24.2 Median :0.01900
## Mean :0.01841 Mean :8.017 Mean :23.4 Mean :0.03261
## 3rd Qu.:0.02100 3rd Qu.:8.200 3rd Qu.:27.0 3rd Qu.:0.03400
## Max. :1.10000 Max. :9.300 Max. :33.2 Max. :1.10000
##
## Nitrate..mg.l. CCME_Values CCME_WQI
## Min. :0.0020 Min. : 51.08 Length:45997
## 1st Qu.:0.0260 1st Qu.: 93.18 Class :character
## Median :0.0770 Median :100.00 Mode :character
## Mean :0.1356 Mean : 96.51
## 3rd Qu.:0.1500 3rd Qu.:100.00
## Max. :5.9000 Max. :100.00
##
The most basic R skills is to query and manipulate various data tables. Table manipulation is also something that is almost always required, regardless of what you decide to apply R for. For beginners, familiarizing and reinforcing table manipulation skills to meet different needs is a great way of improving R skills. If you wish to become really good at R, but don’t know where to start, start with tables!
The base R functions that come with the default R installation have
the capacity for almost all the table manipulation you will need (e.g.,
split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()).
However, sometimes their syntax are less user-friendly and intuitive
than some of the special packages built for table manipulation purposes.
So, here we are introducing a few of the most useful table manipulation
functions within dplyr package. This is a package I use a
lot.
Note that you will have to use install.packages() and
library() function to download and activate the
dplyr before using it.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Now, we will see how different functions of this package work.
select()We can use select() to select column(s) that meet an
specific pattern:
head(select(data, pH..ph.units.)) # select column called pH..ph.units.
## pH..ph.units.
## 1 8.6
## 2 7.3
## 3 6.8
## 4 7.3
## 5 7.4
## 6 7.3
filter()Filter/select row(s) of data based on specific requirement of column(s) values:
head(filter(data, Temperature..cel. > 20)) # select rows that have a temperature higher than 20 C
## Country Area Waterbody.Type Date Ammonia..mg.l.
## 1 China Hou Bay Bay 2001-02-17 5.4
## 2 China Hou Bay Bay 2001-02-11 3.3
## 3 China Hou Bay Bay 2001-02-14 4.5
## 4 China Hou Bay Bay 2001-02-09 1.6
## 5 China Hou Bay Bay 2001-02-22 2.6
## 6 China Hou Bay Bay 2001-02-20 2.8
## Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1 5.2 11.6841
## 2 1.7 11.6841
## 3 7.7 11.6841
## 4 1.4 4.5000
## 5 3.2 11.6841
## 6 4.2 1.8000
## Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1 0.58 7.3 23.6 0.11
## 2 0.49 7.4 27.2 0.28
## 3 0.32 7.3 27.2 0.19
## 4 0.19 6.2 29.4 0.36
## 5 0.27 7.9 30.8 0.26
## 6 0.30 7.3 29.0 0.33
## Nitrate..mg.l. CCME_Values CCME_WQI
## 1 0.32 61.18353 Marginal
## 2 0.68 62.39702 Marginal
## 3 0.50 58.85799 Marginal
## 4 0.70 73.17806 Fair
## 5 0.33 68.12720 Fair
## 6 0.38 67.02537 Fair
head(filter(data, Temperature..cel. > 25 & pH..ph.units. > 8)) # select rows that have a temperature higher than 20 C and a PH higher than 7
## Country Area Waterbody.Type Date Ammonia..mg.l.
## 1 China Hou Bay Bay 2001-02-01 0.130
## 2 China Hou Bay Bay 2001-02-01 0.120
## 3 China Hou Bay Bay 2001-02-01 0.140
## 4 China Hou Bay Bay 2001-02-04 0.220
## 5 China Hou Bay Bay 2001-02-04 0.160
## 6 China Hou Bay Bay 2001-02-04 0.074
## Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1 0.9 11.6841
## 2 0.9 5.5000
## 3 0.6 5.0000
## 4 1.1 4.3000
## 5 1.0 4.6000
## 6 0.7 4.9000
## Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1 0.017 8.3 27.5 0.011
## 2 0.011 8.3 27.4 0.009
## 3 0.008 8.3 27.3 0.010
## 4 0.024 8.6 27.4 0.023
## 5 0.019 8.7 27.3 0.015
## 6 0.014 8.7 27.2 0.012
## Nitrate..mg.l. CCME_Values CCME_WQI
## 1 0.020 93.17895 Good
## 2 0.034 93.18025 Good
## 3 0.033 93.18150 Good
## 4 0.150 86.38163 Good
## 5 0.110 86.38019 Good
## 6 0.060 86.38094 Good
The pipe operator allows you to pipe the output from one function to
the input of the next function. Instead of nesting functions (reading
from the inside to the outside), the idea of of piping is to read the
functions from left to right. It can also help you avoid creating and
saving a lot of intermediate variables that you don’t need to keep. The
old operator for pipes was %>%, but now a new version
has been introduced, |>
# old operator
pipe_result<- data %>%
select(Temperature..cel.) %>%
head()
head(pipe_result)
## Temperature..cel.
## 1 17.0
## 2 18.6
## 3 19.5
## 4 23.6
## 5 27.2
## 6 27.2
# new operator
pipe_result<- data |>
select(Temperature..cel.) |>
head()
head(pipe_result)
## Temperature..cel.
## 1 17.0
## 2 18.6
## 3 19.5
## 4 23.6
## 5 27.2
## 6 27.2
arrange()This function arranges or re-orders rows based on their value, the rows are arranged by default in ascending order
order_data1<- data %>%
arrange(Temperature..cel.)
head(order_data1)
## Country Area Waterbody.Type Date Ammonia..mg.l.
## 1 China Hou Bay Bay 2004-02-06 5.900
## 2 China Hou Bay Bay 2004-02-09 0.015
## 3 China Hou Bay Bay 2008-02-22 0.013
## 4 China Hou Bay Bay 2004-02-06 4.100
## 5 China Hou Bay Bay 2008-02-22 0.010
## 6 China Hou Bay Bay 2008-02-15 0.054
## Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1 4.3 2.5
## 2 4.2 8.1
## 3 1.0 8.8
## 4 3.5 3.2
## 5 0.9 8.7
## 6 0.9 8.8
## Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1 0.490 7.3 13.0 0.093
## 2 0.002 8.3 13.0 0.008
## 3 0.020 8.2 13.2 0.019
## 4 0.380 7.3 13.3 0.130
## 5 0.021 8.2 13.3 0.002
## 6 0.023 7.7 13.4 0.018
## Nitrate..mg.l. CCME_Values CCME_WQI
## 1 0.170 61.69006 Marginal
## 2 0.390 100.00000 Excellent
## 3 0.240 100.00000 Excellent
## 4 0.260 66.25210 Fair
## 5 0.240 100.00000 Excellent
## 6 0.019 100.00000 Excellent
order_data2<- data %>%
arrange(Temperature..cel., pH..ph.units.)
head(order_data2)
## Country Area Waterbody.Type Date Ammonia..mg.l.
## 1 China Hou Bay Bay 2004-02-06 5.900
## 2 China Hou Bay Bay 2004-02-09 0.015
## 3 China Hou Bay Bay 2008-02-22 0.013
## 4 China Hou Bay Bay 2004-02-06 4.100
## 5 China Hou Bay Bay 2008-02-22 0.010
## 6 China Hou Bay Bay 2008-02-15 0.054
## Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1 4.3 2.5
## 2 4.2 8.1
## 3 1.0 8.8
## 4 3.5 3.2
## 5 0.9 8.7
## 6 0.9 8.8
## Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1 0.490 7.3 13.0 0.093
## 2 0.002 8.3 13.0 0.008
## 3 0.020 8.2 13.2 0.019
## 4 0.380 7.3 13.3 0.130
## 5 0.021 8.2 13.3 0.002
## 6 0.023 7.7 13.4 0.018
## Nitrate..mg.l. CCME_Values CCME_WQI
## 1 0.170 61.69006 Marginal
## 2 0.390 100.00000 Excellent
## 3 0.240 100.00000 Excellent
## 4 0.260 66.25210 Fair
## 5 0.240 100.00000 Excellent
## 6 0.019 100.00000 Excellent
# Now we learn pipe operator, can you understand what order_data1 and order_data2 are producing?
Question: Can you arrange the table first by wt and then by hp in decending order?
mutate()The mutate() command creates new column(s) and define
their values. For instance, we can create a new column with just the
year the data was collected. Here we use the function
format and specify we want the year with
"%Y":
new_col<- data %>%
mutate(Year = format(Date, "%Y"))
head(new_col)
## Country Area Waterbody.Type Date Ammonia..mg.l.
## 1 China Hou Bay Bay 2001-02-11 3.5
## 2 China Hou Bay Bay 2001-02-12 6.7
## 3 China Hou Bay Bay 2001-02-14 4.5
## 4 China Hou Bay Bay 2001-02-17 5.4
## 5 China Hou Bay Bay 2001-02-11 3.3
## 6 China Hou Bay Bay 2001-02-14 4.5
## Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1 0.7 8.2000
## 2 2.1 11.6841
## 3 0.3 11.6841
## 4 5.2 11.6841
## 5 1.7 11.6841
## 6 7.7 11.6841
## Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1 0.40 8.6 17.0 0.21
## 2 0.72 7.3 18.6 0.13
## 3 0.51 6.8 19.5 0.32
## 4 0.58 7.3 23.6 0.11
## 5 0.49 7.4 27.2 0.28
## 6 0.32 7.3 27.2 0.19
## Nitrate..mg.l. CCME_Values CCME_WQI Year
## 1 0.48 63.68388 Marginal 2001
## 2 0.16 58.30881 Marginal 2001
## 3 0.65 63.36902 Marginal 2001
## 4 0.32 61.18353 Marginal 2001
## 5 0.68 62.39702 Marginal 2001
## 6 0.50 58.85799 Marginal 2001
Can you create a new column call zero and give it a value of 0 ?
summarise()This function calculates a summary statistics among all rows or rows
within certain grouping, often used in combination with
group_by()
sum_table <- data %>%
summarise(mean(pH..ph.units.))
sum_table
## mean(pH..ph.units.)
## 1 8.016912
sum_table2 <- data%>%
summarise(avg_PH= mean(pH..ph.units.), min_PH= min(pH..ph.units.), max_PH= max(pH..ph.units.))
sum_table2
## avg_PH min_PH max_PH
## 1 8.016912 2.1 9.3
group_by()This is a great function. group_by() divides data rows
into groups based on grouping column(s) provided, often used in
combination with other functions which define what you do with them
after placing them in groups. When group_by() and
summarise() are used together, you are essentially telling
R to separate rows into different groups, and for each groups you use
summarise() to generate a series of summary statistics that
characterize the column values.
Let’s calculate the mean, min and max PH per year:
group_summary <- new_col |>
group_by(Year) |>
summarise(avg_PH= mean(pH..ph.units.), min_PH= min(pH..ph.units.), max_PH= max(pH..ph.units.))
group_summary
## # A tibble: 18 × 4
## Year avg_PH min_PH max_PH
## <chr> <dbl> <dbl> <dbl>
## 1 2001 8.20 4.1 9.3
## 2 2002 8.07 6.7 8.8
## 3 2003 8.17 7.1 8.7
## 4 2004 8.12 7.2 8.9
## 5 2005 8.14 7 9
## 6 2006 8.01 6.8 8.7
## 7 2007 8.08 2.1 9.3
## 8 2008 8.15 5.9 8.9
## 9 2009 8.05 6.9 8.8
## 10 2010 7.96 7.1 8.9
## 11 2011 7.93 7 8.6
## 12 2012 7.81 6.1 8.4
## 13 2013 8.02 7 8.7
## 14 2014 7.95 7.1 8.7
## 15 2015 7.92 6.8 8.7
## 16 2016 7.86 6.5 8.6
## 17 2017 7.91 6.8 8.7
## 18 <NA> 8.13 7.2 8.8
Very cool right!!??
This has been a glimpse to what can be done in R to work with tabular data. There are plenty other packages that in time you will learn by searching online and learning from other people, but for now, all the functions we covered are a very good set of tools to do most of what you will need.
The package ggplot2 is widely used for data
visualization in R. This package is extremely powerful. I used to like
using basic R code for plotting but eventually, I had to admit that
ggplot is extremely cool and had to adopt it.
ggplot2is based on the grammar of graphics, which allows
users to create complex plots from data in a systematic way.
As with any R package, before you can use ggplot2, you need to install it (if you haven’t already) and load it into your R session.
#install.packages("ggplot2")
library(ggplot2)
There are a few concepts that we need to know to understand how to code ggplots.
Data: The dataset you want to visualize. Aesthetics (aes): The visual properties of the plot (e.g., x and y position, color, size). Geometries (geom_): The actual marks we put on the plot (e.g., points, lines, bars). We can, for instance, use geom_line() for plotting lines. Facets: Subplots that display subsets of the data. Scales: Control how data values are mapped to visual properties. Themes: Control the overall appearance of the plot (e.g., font size, background color).
Let’s start with a simple scatter plot using the dataset we have been working with. ### A basic scatter plot
First, lets filter the data to keep only year 2015. Let’s review some data manipulation steps:
# Create data column
data$Date <- as.Date(data$Date, format = "%d-%M-%Y")
# Create year column and call the new object data2
data2<- data %>%
mutate(Year = format(Date, "%Y"))
# Filter year 2015
data3 <- data2 |> filter(Year == 2015)
Now, we can create our first plot. Lets plot temperature against dissolved oxigen
ggplot(data = data3, aes(x = Temperature..cel., y = Dissolved.Oxygen..mg.l.)) +
geom_point()
In here,
ggplot(data = data3, aes(x = Temperature..cel., y = Dissolved.Oxygen..mg.l.))
initializes the plot with the data3 dataset, setting
Temperature..cel. on the x-axis and
Dissolved.Oxygen..mg.l.. geom_point() adds
points to the plot to create a scatter plot.
We can see that as temperature increases, the amount of oxygen dissolved in the water decreases.
We can also see a line of dots that give you a hint that something may be wrong with the measures. But for the purpose of this practical, we can ignore that.
###Titles and labels
We now have a basic plot. Let’s start to customize it. We will first
add a title, subtitle, and axis labels with the labs()
function.
ggplot(data = data3, aes(x = Temperature..cel., y = Dissolved.Oxygen..mg.l.)) +
geom_point() +
labs(title = "Scatter Plot of temp vs diss. oxygen",
x = "Temperature (C)",
y = "Dissolved oxygen (mg/l)")
We can also change the point color using col = "red".
Try other colors, like “blue”, or “green”.
ggplot(data = data3, aes(x = Temperature..cel., y = Dissolved.Oxygen..mg.l.)) +
geom_point(col = "red") +
labs(title = "Scatter Plot of temp vs diss. oxygen",
x = "Temperature (C)",
y = "Dissolved oxygen (mg/l)")
Looking pretty good.
What if we want to add a line trend?
We can use the geom_smooth() function to add a linear
trend. Method refers to the moothing method (function). Here we use a
linear model. se controls whether you plot the standard error. And color
controls the color of the line.
ggplot(data = data3, aes(x = Temperature..cel., y = Dissolved.Oxygen..mg.l.)) +
geom_point(col = "red") +
labs(title = "Scatter Plot of temp vs diss. oxygen",
x = "Temperature (C)",
y = "Dissolved oxygen (mg/l)") +
geom_smooth(method = "lm", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ x'
We now can clearly see the negative relationship.
Let’s try now different geometries.
For a histogram we use geom_histogram().
ggplot(data = data3, aes(x = Temperature..cel.)) +
geom_histogram(binwidth = 2, fill = "blue", color = "black") +
labs(title = "Histogram of water temperature",
x = "Temperature (C)",
y = "Count")
We use geom_box() for a boxplot.
This can be useful to visualize the distribution of a variable across a different categories. For instance, we can look at the concentration of nitrogen across different water quality indexes:
ggplot(data = data3, aes(x = factor(CCME_WQI), y = Nitrogen..mg.l.)) +
geom_boxplot() +
labs(x = "Water Quality Index",
y = "Nitrogen (mg/l)")
The themes control the overall look of your plot. There
are many themes available. For isntance, we can use
theme_minimal().
ggplot(data = data3, aes(x = factor(CCME_WQI), y = Nitrogen..mg.l.)) +
geom_boxplot() +
labs(x = "Water Quality Index",
y = "Nitrogen (mg/l)") +
theme_minimal()
Other available themes include theme_gray(),
theme_classic(), theme_bw(), and more. Check
them out yourself.
With ggplot2, you have a powerful tool to explore and present your data in compelling ways.
There are many options that you can control on ggplot. The best way to learn all the possibilities is by playing with it. Pretty much, everything can be customized. Feel free to experiment with different datasets and ggplot2 functions to create the visualizations that best communicate your insights!
To save your plot to a file use the function ggsave().
Note that first you need to save your plot as an object.
p <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs Weight")
ggsave("scatter_plot.png", plot = p, width = 6, height = 4)
Finally, we want to explore how different parameters change over time.
For this, we need to combine everything we learned to first create the variables we need and then plot them. A full analysis.
First, lets look at the data2 dataframe we created by
using the summary() funtion:
summary(data2)
## Country Area Waterbody.Type Date
## Length:45997 Length:45997 Length:45997 Min. :2001-02-01
## Class :character Class :character Class :character 1st Qu.:2005-02-11
## Mode :character Mode :character Mode :character Median :2009-02-21
## Mean :2009-04-29
## 3rd Qu.:2014-02-01
## Max. :2017-02-28
## NA's :500
## Ammonia..mg.l. Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## Min. : 0.0050 Min. : 0.1000 Min. : 0.000
## 1st Qu.: 0.0240 1st Qu.: 0.5000 1st Qu.: 6.000
## Median : 0.0460 Median : 0.7000 Median : 7.600
## Mean : 0.1007 Mean : 0.9106 Mean : 8.314
## 3rd Qu.: 0.1000 3rd Qu.: 1.1000 3rd Qu.:11.684
## Max. :10.0000 Max. :21.0000 Max. :16.100
##
## Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## Min. :0.00200 Min. :2.100 Min. :13.0 Min. :0.00200
## 1st Qu.:0.00600 1st Qu.:7.900 1st Qu.:19.8 1st Qu.:0.01100
## Median :0.01100 Median :8.000 Median :24.2 Median :0.01900
## Mean :0.01841 Mean :8.017 Mean :23.4 Mean :0.03261
## 3rd Qu.:0.02100 3rd Qu.:8.200 3rd Qu.:27.0 3rd Qu.:0.03400
## Max. :1.10000 Max. :9.300 Max. :33.2 Max. :1.10000
##
## Nitrate..mg.l. CCME_Values CCME_WQI Year
## Min. :0.0020 Min. : 51.08 Length:45997 Length:45997
## 1st Qu.:0.0260 1st Qu.: 93.18 Class :character Class :character
## Median :0.0770 Median :100.00 Mode :character Mode :character
## Mean :0.1356 Mean : 96.51
## 3rd Qu.:0.1500 3rd Qu.:100.00
## Max. :5.9000 Max. :100.00
##
Notice that there are many dates missing in the dataset. That will
create problems, so we need to get rid of those rows. For that, we can
use the function filter() and keep just data without
missing dates using complete.cases(Date):
nrow(data2)
## [1] 45997
data2 <- data2 |> filter(complete.cases(Date))
nrow(data2)
## [1] 45497
Note, we have removed the 500 rows with NAs.
Now, we can create a new dataframe with the average value per year across all years:
China <- data2 |>
mutate(Year = format(Date, "%Y")) |>
group_by(Year) %>%
summarise(
Ammonia = mean(Ammonia..mg.l., na.rm = TRUE),
Dissolved.Oxygen = mean(Dissolved.Oxygen..mg.l., na.rm = TRUE),
pH = mean(pH..ph.units., na.rm = TRUE),
Temperature = mean(Temperature..cel., na.rm = TRUE),
Nitrogen = mean(Nitrogen..mg.l., na.rm = TRUE),
Nitrate = mean(Nitrate..mg.l., na.rm = TRUE),
.groups = "drop"
)
China$Year <- as.Date(strptime(China$Year, "%Y")) # Convert year back to date format
Finally, we can create time series to see how these parameters have changed across years:
ggplot(China, aes(x=Year, y = Ammonia)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
ggplot(China, aes(x=Year, y = Dissolved.Oxygen)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
ggplot(China, aes(x=Year, y = pH)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
ggplot(China, aes(x=Year, y = Temperature)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
ggplot(China, aes(x=Year, y = Nitrogen)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
ggplot(China, aes(x=Year, y = Nitrate)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
To be updated before class.